Factors that Influence the Value of the Coefficient of Determination in Simple Linear and Nonlinear Regression Models
نویسنده
چکیده
Cornell, J. A., and Berger, R. D. 1987. Factors that influence the value of the coefficient of determination in simple linear and nonlinear regression models. Phytopathology 77:63-70. In the fitting of linear regression equations, the coefficient of standard error of the observations. In nonlinear model fitting, the value of determination (R) is one of the most widely used statistics to assess the R 2 is best determined by calculating the proportion of the total variation in goodness-of-fit of the equation. Its value, however, is affected by several the observations that cannot be explained by the fitted model and factors, some of which are associated more closely with the data collection subtracting this proportion from one. Several statistics that are analogous scheme or the experimental design than with how close the regression to the standard formula for R in the linear regression case are given and equation actually fits the observations. These design factors are: the range determined to be inappropriate in the nonlinear case. The use of R alone as of values of the independent variable (X), the arrangement of X values a model-fitting criterion is often risky and other statistics should be used to within the range, the number of replicate observations ( 1), and the variation assess the goodness of the model when responses from quantitative among the Y values at each value of X. Another little-known fact is the treatments are analyzed by regression techniques. effect on R of the ratio of the slope of the fitted equation to the estimated Additional key words: coefficient of determination, residuals, standard error. Linear regression is a commonly used statistical analysis in plant value often are contrary to the principles of good experimental pathology. It has been used, for example, to determine inoculum design. density/disease intensity relationships (5), survival of pathogens We shall answer the second question by listing several analogous over time (16), growth, sporulation, and infection of pathogens statistics to R 2 that are sometimes provided by current computer under different environments (9,10), model testing (8), and disease programs for regression analysis. intensity/crop loss relationships (1). Nonlinear regression is used frequently to fit disease proportions over time to various growth METHODS models (2,12), disease prediction from environmental parameters (8), crop loss estimation from disease intensity (13), growth, Artificial data sets were generated and linear or nonlinear models sporulation, and infection of a pathogen with temperature (3), and were fitted by least-squares regression either by hand calculation or the relationship of disease intensity to size of experimental plots (7) by the Statistical Analysis System package (15), using the facilities or to calcium carbonate concentration (4). of the Northeast Regional Data Center of the State University For both linear and nonlinear regression, the coefficient of System of Florida in Gainesville. determination is possibly the statistic used most often to assess the goodness-of-fit of empirical models fitted to data. This is because RESULTS AND DISCUSSION the value of R 2 is provided by every current computer program for regression analysis. Nearly every published article, in which Factors that affect R 2 in the fitting of simple linear regression regression analysis was performed, lists the R associated with each equations. In the simple linear regression equation, Y1 = a + bXi + equation fitted. The appropriateness of R to assess the goodness of ej, Y. is the ith observation of the dependent variable and Xi is the a fitted model is under investigation (11) and, until alternative value of the independent variable at which Yi is observed. The measures are suggested, it is imperative that the meaning of R and quantities a and b are unknown parameters that represent the the factors that influence it be understood. intercept and slope of the regression line, respectively. The random In the fitting of regression models, researchers occasionally raise error associated with Y1 is termed ej. The usual assumptions one or the other of the following two questions when they discover regarding the errors are, that in a population of N values of Y,, the the value of R 2 is extremely low for their model: Why is R 2 so low random errors (el) have zero mean, a common variance (o2), and when the equation seems to fit the data very well? What is the are independent of one another. appropriate method to calculate R 2 to determine the goodness-ofTo illustrate the calculations that are required in the analysis of a fit of a nonlinear model, e.g., exponential models or power fitted regression equation, the simple linear regression equation ( Y functions? = a + bX, + ei) is fitted to each of two data sets denoted as A and B. In this article, to address the first question we identify some of the The observations (Y) are the same in data sets A and B but the factors in a data set that lower the value of R. Our purpose in ranges of Xi are different (Table 1). The plots of the fitted regression singling out these factors is twofold: first, to acquaint users of equations are shown in Figure 1. regression techniques of the potential pitfalls that result from Included among the entries in Table 1 are the predicted responses relying too heavily on R2 as a model closeness criterion, and (Y•) at each Xj^obtained with the fitted regression equation. The second, to point out that corrective actions to obtain a high R 2 quantity Yi Yi, represents the difference between the" observed The publication costs of this article were defrayed in part by page charge payment. This value (Y) and the predicted value (Y) at Xj, and this difference is article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. called the residual corresponding to the it observation. The larger §_1734 solely to indicate this fact. the values of the residuals, the less confident one feels about how ©1987 The American Phytopathological Society well the estimated equation fits the observed values. A numerical Vol. 77, No. 1, 1987 63 measure, therefore, of how well the model actually fits the data is containing the variable Xis employed. A reasonable measure of the the variance of the residuals, s2 = SSE/ (N2). When the residuals effect of Xin explaining the variation in Yis R , calculated either as are large, s2 is large. Also, the individual residuals can be plotted against the values of X or the values of Yj to ascertain if the linear R SSE/ SSr or R= (SS SSE)/ SS y (1) model is indeed the appropriate choice. In both plots, if the model is correct, the values of the residuals will exhibit random scatter The quantity (SS rSSE) is equal to bSSxy and is the regression about the line, YY = 0, and the approximate scatter is uniform sums of squares (S.S. Regression), that is, the variation in the Y, for all values of X and/or Y1. values explained, or accounted for, by the fitted regression The positive square root Of S2 (i.e., se) is called the estimated equation Yi a + bXj. standard error of the Y. values about the regression line (6), that is, S= X/ Y( Yi) 2 / (N2) is a function of the residuals, Y1 -j, about the regression equation. Thus Se represents a measure of the R B 0.8575 R 0.9029 error with which any observed value of Y is selected from the 50 .o + 5.68x Y 10-.3 + 3.04X% distribution of Y values at each value of X. A In Figure 1, the two plots ofXand Yvalues for sets A and B differ 0 A only in the spread or range of Xv. In set A, the range is 13 1 = 12 408 A units, whereas in set B the range is 8 1 = 7 units. The different ranges of Xi result in different estimates both for a and b in the two fitted regression equations and also different estimates of the error Y 30 _ variance (s). Because R = 0.9629 for the fitted equation with data set A is higher than R = 0.8575 with data set B, in spite of the fact that the estimated slope, b, is larger with set B than with set A, we 20 are led to believe the equation Y= 10.63 + 3.04Xi fits the data in set A.8 A.B A better than the equation Y1 = 6.0 + 5.58Xi fits the data in set B. A.B Before we can determine if indeed this is the case, we need to define 10 R2 Definition of R . In the calculation of the summary statistics (Table 1), the quantity SS_ is a measure of the variation in the Y, 0.0 L _L 2 4 6 10 12 14 values about their mean, Y. In other words, SS y is a measure of the X uncertainty in predicting Y without taking X into consideration. Fig. 1. Linear regression equation fitted to two data sets (A and B) with Similarly, SSE is a measure of the variation in the values of Y1, or identical Y values. The different ranges of X cause different estimates of the uncertainty in predicting Y, when a regression model slopes, intercepts, and R values. TABLE 1. Calculations needed to obtain the fitted regression equations and other summary statistics for two data sets Observations Data set A Data set B Yi Xi YiY XiS YYi Xi XiX Yy r,Yi 15 1 -16.125 -5.75 13.67 1.33 1 -3.5 11.58 3.42 17 2 -14.125 -4.75 16.70 0.30 2 -2.5 17.17 0.17 20 3 -11.125 -3.75 19.74 0.26 3 -1.5 22.75 2.75 18 4 -13.125 -2.75 22.78 -4.78 4 -0.5 28.33 -10.33 43 9 11.875 2.25 37.96 5.04 5 0.5 33.92 9.08 42 10 10.875 3.25 40.99 1.01 6 1.5 39.50 2.50 45 12 13.875 5.25 47.06 -2.06 7 2.5 45.08 0.08 49 13 17.875 6.25 50.10 -1.10 8 3.5 50.67 1.67 249.0 249.0 Y 31.125 31.125 ,•Xi 54.0 36.0 X 6.75 4.5 y( y ) = SS Y = 1,526.875 1,526.875 I(X=•_) SS x= 159.50 42.0 X(X(Y) Y)SSXY 484.25 234.5 y(y. yi) = SSE 56.67 217.58 Estimate of the intercept = Y bX = 10.63 6.00 Estimate of the slope b = SSxy / SSx = 3.04 5.58 S. S. Regression SS ySSE= bSSxy= 1,470.21 1,309.29 Coefficient of determination R'= 1 SSE/SSr= 0.9629 0.8575 Estimate of oe: s= SSE/(N2)= 9.44 36.26 Slope/standard error se 0.9894 0.9267 Fitted regression equation Y 10.63 + 3.04X; j 1 = 6.00 + 5.58Xi
منابع مشابه
Derivation of regression models for pan evaporation estimation
Evaporation is an essential component of hydrological cycle. Several meteorologicalfactors play role in the amount of pan evaporation. These factors are often related to eachother. In this study, a multiple linear regression (MLR) in conjunction with PrincipalComponent Analysis (PCA) was used for modeling of pan evaporation. After thestandardization of the variables, independent components were...
متن کاملQuantifying the Germination of Fagopyrum esculentum Moenc. Using Regression and Thermal-Time Models
Extended Abstract Introduction: Germination is considered the first and most important stage of establishment and consequently, successful competition which is influenced by genetic and environmental factors. Among the environmental factors influencing the germination, temperature and light are the most important ones. Using different models, the germination response of seeds to temperature c...
متن کاملNondestructive Determination of the Total Volatile Basic Nitrogen (TVB-N) Content Using hyperspectral Imaging in Japanese Threadfin Bream (Nemipterusjaponicus) Fillet
Background and Objectives: Considering the importance of safety evaluation of fish and seafood from capture to purchase, rapid and nondestructive methods are in urgent need for seafood industry. This study aimed to assess the application of hyperspectral imaging (HSI: 430-1010 nm) for prediction of total volatile basic nitrogen (TVB-N) in Japanese-threadfin bream (Nemipterusjaponicus) fillets, ...
متن کاملIntroduction of proper model of land slide relationship on sediment in GolGol basin system
Extended abstract 1- Introduction Investigating the relationship between landslides in sediment production in watersheds is one of the most important issues in the management of watersheds. The purpose of this research is to introduce a suitable model for the effect of landslide on sediment load in Gol Gol watershed in Ilam province, with the assumption that the linear relationship betw...
متن کاملDetermination of the linear and non-linear relationships between soil erodibility factor and effective parameters on it in a mountainous watershed with severe soil erosion
Soil erodibility factor is a criterion of soil particle resistance to detachment, transport, and effects of erosivity factors (rain drop, runoff, and wind) during the soil loss processes. In this study, non-linear support vector machines (SVMs) method was used for investigating the effects of some topography, soil physical and mechanical properties on soil erodibility in a part of Northern Karo...
متن کاملارزیابی مدلهای رگرسیونی غیرخطی جهت توصیف پاسخ جوانهزنی بذر چاودار کوهی (Secale mountanum) به دما
The present study sought to evaluate the effect of different temperatures on germination and to determine cardinal temperatures (i.e., base, optimum and maximum) of Secale mountanum at temperatures of 3, 5, 10, 15, 20, 25, 30 and 35oC. Three nonlinear regression models (i.e., segmented, dent-like and beta) were used for quantifying the response of germination rate to temperature. The results sh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006